sketch out sync codecs + threadpool by d-v-b · Pull Request #3715 · zarr-developers/zarr-python

d-v-b · 2026-02-18T20:51:17Z

This is a work in progress with all the heavy lifting done by claude. The goal is to improve the performance of our codecs by avoiding overhead in to_thread and other async machinery. At the moment we have deadlocks in some of the array tests, but I am opening this now as a draft to see if the benchmarks show anything promising.

codspeed-hq · 2026-02-18T21:12:53Z

Merging this PR will improve performance by ×5

⚡ 50 improved benchmarks
✅ 6 untouched benchmarks
⏩ 6 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1,031.6 ms	270.8 ms	×3.8
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	554.3 ms	181.7 ms	×3.1
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	1,551.5 ms	684.4 ms	×2.3
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	2,111.7 ms	791.6 ms	×2.7
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	5.5 s	1.8 s	×3.1
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	9.7 s	2.6 s	×3.7
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	1,204.9 ms	552.4 ms	×2.2
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	5.5 s	1.8 s	×3.1
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	9.7 s	2.6 s	×3.7
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	2.7 s	1.3 s	×2
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, 10, None), slice(None, 10, None), slice(None, 10, None))-memory]`	1,831.3 µs	662.2 µs	×2.8
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	278.1 ms	66.7 ms	×4.2
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	1,315 ms	532.1 ms	×2.5
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1,631.2 ms	639.4 ms	×2.6
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	6 s	1.4 s	×4.2
⚡	WallTime	`test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	619.7 ms	143.9 ms	×4.3
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	2,886.8 ms	604.5 ms	×4.8
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory]`	419.2 ms	99.4 ms	×4.2
⚡	WallTime	`test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	952.5 ms	228.6 ms	×4.2
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	3.2 s	1.5 s	×2.2
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing d-v-b:perf/faster-codecs (9d77ca5) with main (f8b3d38)}

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

…thon into perf/faster-codecs

d-v-b · 2026-02-19T10:53:00Z

@@ -0,0 +1,228 @@
+# Design: Fully Synchronous Read/Write Bypass


@rabernat @dcherian have a look, this is claude's summary of the perf blockers addressed in this PR

d-v-b · 2026-02-19T12:49:49Z

performance impact ranges from "good" to "amazing" so I think we want to learn from this PR. IMO this is NOT a merge candidate but rather should function as a proof-of-concept for what we can get if we rethink our current codec API.

Some key points:

Wrapping CPU-bound routines like gzip encode / decode with async adds needless latency. We get a lot of perf by using a sync fast path whenever possible. We need to bake this "sync is faster when available" lesson into both our codec API and store API. For example, there is no reason that reading or writing to an in-memor dict should be async.
We should design the chunk encoding process so that IO bound and CPU-bound routines are logically separated in the codebase. That means modelling sharding as a codec is probably wrong. Sharding is declared as a codec in array metadata, but we don't need to model it as a codec internally. Sharding changes how we do IO, but it should not change when we do IO.
I haven't looked at memory use at all. that's probably a separate effort.

d-v-b · 2026-02-19T13:06:24Z

the current performance improvements are without any parallelism. I'm adding that now.

d-v-b · 2026-02-19T13:53:27Z

the latest commit adds thread-based parallelism to the synchronous codec pipeline. we compute an estimated compute cost based on the chunk size, codecs, and operation (encode / code), and use that estimate to choose a parellelism strategy, ranging from no threads to full use of a thread pool.

d-v-b · 2026-02-20T15:20:17Z

marking this as not a draft, because I think we should actually consider merging it.

…ospection more efficient

…into perf/faster-codecs

mkitti

Could we adjust work estimates based on codec parameters?

dcherian · 2026-02-20T20:51:09Z

+_MIN_CHUNK_NBYTES_FOR_POOL = 100_000  # 100 KB
+
+
+def _choose_workers(n_chunks: int, chunk_nbytes: int, codecs: Iterable[Codec]) -> int:


Can this be def _use_thread_pool(...)->bool instead?

dcherian · 2026-02-20T20:51:31Z

-
-def _get_pool(max_workers: int) -> ThreadPoolExecutor:
-    """Get a thread pool with at most *max_workers* threads."""
+def _get_pool() -> ThreadPoolExecutor:


hard to see why this had to change but... i"m not opposed to it.

…into perf/faster-codecs

d-v-b · 2026-02-23T12:40:25Z

The changes here improve performance a lot, but I think we can do even better with a more comprehensive set of changes. I had claude cook up a planning document based on zarrs and tensorstore here: https://hackmd.io/A933wEUwQjOx8rJmmWo13A. Please review that document. I will use this plan to guide the next round of performance improvements. Not sure if they will be in this PR or a subsequent one.

dcherian · 2026-02-24T17:33:56Z

    assert get_pipeline_class().__name__ != ""

-    config.set({"codec_pipeline.name": "zarr.core.codec_pipeline.BatchedCodecPipeline"})
+    config.set({"codec_pipeline.path": "zarr.core.codec_pipeline.BatchedCodecPipeline"})


what's this about?

dcherian · 2026-02-24T17:35:03Z

+    # _open() from a sync context, so we replicate its logic here.
+    # -------------------------------------------------------------------
+
+    def get_sync(


are we able to share sync/async code paths at all?

dcherian · 2026-02-24T17:36:05Z

+
+# Minimum chunk size (in bytes) to consider using the thread pool.
+# Below this, per-chunk codec work is too small to offset dispatch overhead.
+_MIN_CHUNK_NBYTES_FOR_POOL = 100_000  # 100 KB


let's make this a config, so its easy to experiment with

dcherian · 2026-02-24T17:50:29Z

+        if self._all_sync:
+            # Streaming per-chunk pipeline: each chunk flows through
+            # read_existing → decode → merge → encode → write as a single
+            # task. Running N tasks concurrently overlaps IO with compute.
+            async def _write_chunk(
+                byte_setter: ByteSetter,


why is there an async func under self._all_sync? Seems like a naming issue that is very confusing to me right now

we send this function into concurrent_map, which is why it needs to be async

d-v-b · 2026-04-08T17:57:17Z

superseded by #3885

sketch out sync codecs + threadpool

f427898

github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2026

d-v-b added benchmark Code will be benchmarked in a CI job. and removed needs release notes Automatically applied to PRs which haven't added release notes labels Feb 18, 2026

Merge branch 'main' into perf/faster-codecs

dbdc3d4

github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2026

d-v-b added 5 commits February 19, 2026 08:45

fix perf regressions

65d1230

Merge branch 'perf/faster-codecs' of https://github.com/d-v-b/zarr-py…

e24fe7e

…thon into perf/faster-codecs

add partial encode / decode

f979eaa

add sync hotpath

a934899

add comments and documentation

b53ac3e

d-v-b commented Feb 19, 2026

View reviewed changes

d-v-b added 4 commits February 19, 2026 12:29

refactor sharding to allow sync

73ac845

fix array spec propagation

aeecda8

fix countingdict tests

69172fb

update design doc

28d0def

dynamic pool allocation

f8e39e6

d-v-b added 7 commits February 19, 2026 15:03

default to 1 itemsize for data types that don't declare it

b388911

Merge branch 'main' into perf/faster-codecs

7e29ef3

Merge branch 'main' into perf/faster-codecs

00dde0b

remove extra codec pipeline

9d77ca5

remove garbage

88a4875

lint

284e5e2

use protocols for new sync behavior

b1b876a

d-v-b marked this pull request as ready for review February 20, 2026 15:19

d-v-b commented Feb 20, 2026

View reviewed changes

Comment thread src/zarr/core/codec_pipeline.py Outdated

d-v-b added 2 commits February 20, 2026 21:09

fix type hints, prevent thread pool leakage, make codec pipeline intr…

01e1f73

…ospection more efficient

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

fbde3af

…into perf/faster-codecs